Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsexec: cloned binary rework #3987

Merged
merged 8 commits into from
Sep 24, 2023
Merged

nsexec: cloned binary rework #3987

merged 8 commits into from
Sep 24, 2023

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 18, 2023

Split out from #3953 to be reviewed separately.


First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Then, carry the code from #3983 and combine it with the above to
keep all of the runc-dmz logic in the Go code as well. The runc-dmz
feature is optional (controlled by the runc_nodmz build tag at
build time, or RUNC_DMZ=legacy at runtime), with a fallback to the
classic /proc/self/exe cloning. If /proc/self/exe is already a
cloned binary then neither runc-dmz nor /proc/self/exe are
cloned.

This also adds a memfd-bind helper binary for users which really
need to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).

Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai [email protected]

@cyphar cyphar added this to the 1.2.0 milestone Aug 18, 2023
@cyphar cyphar marked this pull request as ready for review August 20, 2023 13:58
README.md Outdated Show resolved Hide resolved
}
path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd()))
if err := unix.Access(path, unix.X_OK); err == nil {
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?

Copy link
Member Author

@cyphar cyphar Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).

I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock lock contention again....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new RUNC_DMZ=legacy option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.

Makefile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
@cyphar cyphar requested a review from lifubang August 23, 2023 07:10
@cyphar cyphar dismissed lifubang’s stale review August 23, 2023 08:07

switched to LANG=C.UTF-8

Makefile Outdated Show resolved Hide resolved
libcontainer/dmz/dmz.go Outdated Show resolved Hide resolved
@cyphar cyphar mentioned this pull request Mar 14, 2024
@kolyshkin kolyshkin mentioned this pull request Oct 11, 2024
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang mentioned this pull request Oct 15, 2024
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang [email protected]
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lifubang [email protected]
Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 15, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 17, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

But because of we have added safeExe to the set of ExtraFiles, if the
fd of safeExe is too small, go stdlib will dup3 it to another fd, then
it will cause the original fd closed. (opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 17, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 5
or 6 was closed at that time, maybe it will be reused by memfd.

But because of we have added safeExe to the set of ExtraFiles, if the
fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will
dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 17, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 5
or 6 was closed at that time, maybe it will be reused by memfd.

But because of we have added safeExe to the set of ExtraFiles, if the
fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will
dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 17, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 5
or 6 was closed at that time, maybe it will be reused by memfd.

But because of we have added safeExe to the set of ExtraFiles, if the
fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will
dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 17, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 6
or 7 was closed at that time, maybe it will be reused by memfd.

But because of we have added safeExe to the set of ExtraFiles, if the
fd of safeExe is not bigger than stdio fds count + ExtraFiles count, go
stdlib will dup3 it to another fd, then it will cause the original fd
closed. (opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 6
or 7 was closed at that time, maybe it will be reused by memfd.

Because we want to add safeExe to the set of ExtraFiles, if the fd of
safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a
other fd to this fd, then it will cause the fd type cmd.Path refers to
a random path. (issue: opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init,
due to a Go stdlib bug, we need to add safeExe to the set of
ExtraFiles otherwise it is possible for the stdlib to clobber the fd
during forkAndExecInChild1 and replace it with some other file that
might be malicious. This is less than ideal (because the descriptor
will be non-O_CLOEXEC) however we have protections in "runc init" to
stop us from leaking extra file descriptors.
See <golang/go#61751>.

There is a race situation when we are opening this memfd, if the fd 6
or 7 was closed at that time, maybe it will be reused by memfd.

Because we want to add safeExe to the set of ExtraFiles, if the fd of
safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a
other fd to this fd, then it will cause the fd type cmd.Path refers to
a random path. (issue: opencontainers#4294)

Signed-off-by: lfbzhm <[email protected]>
lifubang added a commit to lifubang/runc that referenced this pull request Oct 18, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <[email protected]>
kolyshkin pushed a commit to lifubang/runc that referenced this pull request Oct 21, 2024
This reverts commit 65a1074.

we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the
memfd logic will use more memory than before, but we have not yet moved binary clone from
runc init to runc parent process, so we need to increase memory limit in CI.
As we have moved the runc binary clone logic from runc init to runc parent process in
(opencontainers#3987), so the memory usage of binary clone will not be included in container's memory
cgroup accounting. Now we can support run a simple container with lower memory usage the
same as before.

Signed-off-by: lfbzhm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants